Unsupervised Learning and Dimensionality Reduction

1. Objective:

The purpose of this project is to analyze and compare unsupervised learning algorithms by applying them to two different classification problems. Two types of unsupervised learning algorithms analyzed in this project are Clustering Algorithms and Dimensionality Reduction algorithms.

2. Implementation Approach:

Analysis will be carried out by implementing following steps on the two datasets -

Step 1: K means and GaussianMixtureModel clustering algorithms will be executed on selected datasets. Metrics such as SSE, Silhouette Co-efficients, Adjusted Mutual Information etc. will be compared to identified optimal cluster.
Step 2: PCA, FastICA, Randomized Projections and Random Forest Classifier will be applied on two datasets to conduct dimensionality reduction experiments.
Step 3: Both clustering algorithms will be executed on the datasets which were reduced using above four dimensionality reduction techniques.
Step 4: Neural Networks will be implemented on one of the datasets on which four dimensionality reduction techniques are implemented in Step 2 and results will be analyzed.
Step 5: In this step Neural Networks will be implemented on one of the datasets on which two clustering algorithms are implemented and results will be analyzed.

3. Datasets

3.1 Wine Quality Prediction

Problem: Given the physicochemical test results of a wine, predict whether the quality is wine is above average.

Dataset Details: UCI Wine Quality dataset consists physicochemical and sensory data red and white variants of the Portuguese "Vinho Verde" wine. Dataset is available at https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute

Why is this problem interesting?
Domain: Predicting the quality of the wine is crucial while determining its price. It is important to correctly identify above-average wines and hence classifiers must lower False Positive rate. It is acceptable if the classifier has a slightly higher False Negative rate. It will be interesting to see how the classifier implements this constraint.
Dataset: The target variable is the multi-class categorical variable and the only couple of classes are dominant. All the independent variables are numerical and there are no missing values in the dataset. Only a few features are slightly skewed. It will be interesting to see whether supervised learning algorithms show any bias/variance on such a clean dataset.
Observations from Supervised Learning algorithms: When supervised learning algorithms were executed on this dataset it was observed that, algorithms were quickly overfitting possible due to large number of attributes and less instances. It will be interesting to observe how reducing dimensions will impact accuracy. Also, original dataset has 9 classes whereas classes were reduced to 2 for supervised learning. It will also interesting to see how clusters algorithms create from this dataset.

3.2 Income Prediction

Problem: Predict whether the income of the individual exceeds 50k a year based on the individual's census data.

Dataset Details: Dataset was extracted by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Dataset is available at https://archive.ics.uci.edu/ml/datasets/Census+Income

Number of Instances: 48842
Number of Attributes: 14 + output attribute

Why is this problem interesting?
Domain: Predicting the income of an individual has many practical applications across the financial industry. For example, banks can decide whether to approve a loan based on the predicted income of an individual. In such scenarios, it is crucial to lower False Positive rate and it may be acceptable to have a slightly higher False Negative rate. It will be interesting to see how various classifiers implement these constraints.
Dataset: High-level data exploration revealed that many of the independent features are categorical with many classes and some of these are qualitative while others are quantitative. Few of the features are highly skewed. It will be interesting to see how these features impact the performance of supervised algorithms. I also notice the missing values in some of the instances. I will also be interesting to see whether imputing or removing these instances will create any bias in the model.
Observations from Supervised Learning algorithms: During implementation of supervised learning algorithms it was observed that computation time of many algorithm was high. Multi-collinearity was also observed in this dataset. It will be interesting to observe how reducing dimensions will impact accuracy of the supervised learning algorithms..

In [1]:
#Packages Used
import itertools
import pandas as pd
import numpy as np
import utils
import time
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.cm as cm
from scipy import linalg
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, adjusted_mutual_info_score, adjusted_rand_score, homogeneity_completeness_v_measure
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.decomposition import PCA
from sklearn.decomposition import FastICA
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report 
from sklearn.metrics import average_precision_score, precision_recall_curve

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_predict, cross_val_score, GridSearchCV
%matplotlib inline

4. Data Exploration and Pre-processing

4.1 Wine Quality Prediction

In [2]:
#load red wine dataset
dataset_loc_red = "./datasets/wine/winequality-red.csv"
df_red = pd.read_csv(dataset_loc_red, sep=";")

#load white wine dataset
dataset_loc_white = "./datasets/wine/winequality-white.csv"
df_white = pd.read_csv(dataset_loc_white, sep=";")

#combine both datasets
df_combined = df_red.append(df_white, ignore_index=True)

# check for missing values
df_combined.isnull().values.any()
Out[2]:
False
In [3]:
# check for feature data types
df_combined.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 12 columns):
fixed acidity           6497 non-null float64
volatile acidity        6497 non-null float64
citric acid             6497 non-null float64
residual sugar          6497 non-null float64
chlorides               6497 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
dtypes: float64(11), int64(1)
memory usage: 609.2 KB
In [4]:
#Analyze Features
df_combined.describe()
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000
mean 7.215307 0.339666 0.318633 5.443235 0.056034 30.525319 115.744574 0.994697 3.218501 0.531268 10.491801 5.818378
std 1.296434 0.164636 0.145318 4.757804 0.035034 17.749400 56.521855 0.002999 0.160787 0.148806 1.192712 0.873255
min 3.800000 0.080000 0.000000 0.600000 0.009000 1.000000 6.000000 0.987110 2.720000 0.220000 8.000000 3.000000
25% 6.400000 0.230000 0.250000 1.800000 0.038000 17.000000 77.000000 0.992340 3.110000 0.430000 9.500000 5.000000
50% 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.000000 0.994890 3.210000 0.510000 10.300000 6.000000
75% 7.700000 0.400000 0.390000 8.100000 0.065000 41.000000 156.000000 0.996990 3.320000 0.600000 11.300000 6.000000
max 15.900000 1.580000 1.660000 65.800000 0.611000 289.000000 440.000000 1.038980 4.010000 2.000000 14.900000 9.000000
In [5]:
#Ploting the class distribution
sns.countplot(x='quality', data = df_combined);
In [6]:
#plot correlation heat map
plt.figure(figsize =(10,10))
sns.heatmap(df_combined.corr(),annot=True)
plt.show()
In [7]:
# Categorize wine into "above average" and "below average" wines
df_combined["quality"].values[df_combined["quality"] <= 5] = 0
df_combined["quality"].values[df_combined["quality"] > 5] = 1
In [8]:
Wine_X = df_combined.iloc[:,:-1]
Wine_y = df_combined.iloc[:,-1]
In [9]:
Wine_X_Scaled = StandardScaler().fit_transform(Wine_X)
In [10]:
# Split the data into a training set and a test set
Wine_X_train, Wine_X_test, Wine_y_train, Wine_y_test = train_test_split(Wine_X_Scaled, Wine_y, random_state=1, test_size=0.30)

4.2 Income Prediction

In [11]:
#load census dataset
colnames=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'] 
dataset_loc = "./datasets/census/adult.data"
df_census = pd.read_csv(dataset_loc, names=colnames, header=None, skipinitialspace=True)

# check for missing values
df_census.isnull().values.any()
Out[11]:
False
In [12]:
# check for feature data types
df_census.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
In [13]:
#Analyze Features
df_census.describe()
Out[13]:
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 1077.648844 87.303830 40.437456
std 13.640433 1.055500e+05 2.572720 7385.292085 402.960219 12.347429
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000 99.000000
In [14]:
# Replace missing values "?" with NAs 
df_census.replace('?', np.nan, inplace=True)
df_census.dropna()
Out[14]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

30162 rows × 15 columns

In [15]:
df_census.dropna(axis=0, inplace=True)
In [16]:
# converting target variable.
# df_census['income'] = df_census['income'].apply(lambda x: 1 if x=='>50K' else 0)
df_census["income"].values[df_census["income"] == '<=50K'] = 0
df_census["income"].values[df_census["income"] == '>50K'] = 1
In [17]:
#Feature Encoding
le = preprocessing.LabelEncoder()

for i in range(0,df_census.shape[1]):
    if df_census.dtypes[i]=='object':
        df_census[df_census.columns[i]] = le.fit_transform(df_census[df_census.columns[i]])
In [18]:
#Ploting the class distribution
sns.countplot(x='income', data = df_census);
In [19]:
Census_X = df_census.iloc[:,:-1]
Census_y = df_census.iloc[:,-1]
In [20]:
Census_X_Scaled = StandardScaler().fit_transform(Census_X)
In [21]:
# Split the data into a training set and a test set
Census_X_train, Census_X_test, Census_y_train, Census_y_test = train_test_split(Census_X_Scaled, Census_y, random_state=1, test_size=0.30)

5. K-Means and Expectation Maximization

K-Means is a popular unsupervised machine learning algorithm which clusters data points together based on certain similarities. K refers to the number of desired clusters. Data point is included in a cluster if it is closer to that cluster's centroid than centroid of any other cluster. K-Means algorithm can converge to a local minimum. K-Means finds the best centroids by assigning data points to clusters based on the current centroids and then choosing centroids based on the current assignment of data points to clusters.

Distance Measure: Choice of the distance measure will greatly influence the performance of the K-Means and Expectation Maximization algorithms. There is no free lunch i.e. the nest distance measure for all algorithms and choice of distance measure depends on the dataset and problem type. Since both of out problems are binary classification and we have clean and compact dataset with minimal outliers we will use Euclidean distance. Mahala Nobis measure is computationally expensive whereas Cosine measure is suitable for document similarity hence they are not used. Manhattan distance generally works only if the points are arranged in the form of a grid and it ignores geometric distance. In K-Means we are minimizing sum of squares of distances which will increase Euclidean distance. If we use Manhattan distance, we cannot prove that decreasing sum of squares of distances will give us max Manhattan distance. In this project both Kmeans and EM algorithms are analyzed on the K value between 2 and 14.

Although clustering and dimensionality reduction are unsupervised learning methods which do not require labelled data and test dataset, dataset was split into test and train for future use to analyze performance of neural network. For clustering and dimensionality reduction experiment in next steps I used only training set for both datasets.

5.1 K-Means

5.1.1 Wine Quality Prediction

In [22]:
model_stats = []
kclusters = range(2, 15)

for k in kclusters:
    model = KMeans(n_clusters=k)
    t_start = time.time()*1000.0
    model.fit(Wine_X_Scaled)
    predcls = model.predict(Wine_X_Scaled)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Wine_y,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

dfstatsKM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'sse', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'exectime'])

5.1.2 Income Prediction

In [23]:
model_stats = []
kclusters = range(2, 15)

for k in kclusters:
    model = KMeans(n_clusters=k)
    t_start = time.time()*1000.0
    model.fit(Census_X_Scaled)
    predcls = model.predict(Census_X_Scaled)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Census_y,predcls)
    adjrandinfo = adjusted_rand_score(Census_y,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

dfstatsCensusKM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'sse', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'exectime'])
In [24]:
# dfstatsCensusKM

5.1.3 Elbow Analysis

Algorithm was executed multiple times with increasing number of cluster and then chart was plotted for within-cluster distances against cluster count. It was expected that within-cluster distance will fall sharply for initial cluster counts but curve will flatten out creating a elbow pattern. Optimal value of K is the point where elbow bends and within-cluster distance stops following sharply.

In [25]:
plt.figure(figsize=(6, 6))
plt.plot(dfstatsKM['k'],dfstatsKM['sse'], '-o')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distance')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Wine - Sum of Squared Distance")
plt.show()
In [26]:
plt.figure(figsize=(6, 6))
plt.plot(dfstatsCensusKM['k'],dfstatsCensusKM['sse'], '-o')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distance')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Census - Sum of Squared Distance")
plt.show()

Observation: As number of clusters increases, sum of squared distances decreases which leads to more generalized clusters at the lower value of K. It can be observed that for both datasets the elbow is not prominent. Although within-cluster distance falls sharply but it continues to fall as number of clusters increases. Both graphs do not give us clear value of K and hence Silhouette analysis was carried out.

5.1.4 Silhouette Analysis

Silhouette co-efficient measures how similar an object is to its own cluster as compared to other clusters. As compared to elbow method Silhouette chart provide clear representation of clusters and how well each object is classified. A Silhouette co-efficient of +1 indicates that the object closes to its own cluster and far away from neighboring cluster. Whereas value of -1 indicates that the object is far from its own cluster and close to neighboring cluster. A value of 0 suggest that object is at the boundary of the of two clusters. We prefer higher value of Silhouette co-efficient.

In [27]:
# Below code to plot Silhouette chart was adapted from scikit learn website.
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

for n_clusters in range(2, 15):
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(Wine_X_Scaled) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(Wine_X_Scaled)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(Wine_X_Scaled, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(Wine_X_Scaled, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(Wine_X_Scaled[:, 0], Wine_X_Scaled[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()
For n_clusters = 2 The average silhouette_score is : 0.27647857692015093
For n_clusters = 3 The average silhouette_score is : 0.2350697475913732
For n_clusters = 4 The average silhouette_score is : 0.2475267723264805
For n_clusters = 5 The average silhouette_score is : 0.18050631736331987
For n_clusters = 6 The average silhouette_score is : 0.18659968985248593
For n_clusters = 7 The average silhouette_score is : 0.17954046503516125
For n_clusters = 8 The average silhouette_score is : 0.16059650486968527
For n_clusters = 9 The average silhouette_score is : 0.1492492306283983
For n_clusters = 10 The average silhouette_score is : 0.14500887727866757
For n_clusters = 11 The average silhouette_score is : 0.1409414603610983
For n_clusters = 12 The average silhouette_score is : 0.13256495730023238
For n_clusters = 13 The average silhouette_score is : 0.1331782929784289
For n_clusters = 14 The average silhouette_score is : 0.1440680243301704

Observations: As wine dataset is a labeled dataset with labels ranging from 1 to 9, I expected 7-9 clusters to give the highest Silhouette score. It was observed that highest Silhouette score of 0.509 was achieved with two clusters whereas Silhouette score for cluster 7 and above was very low. Analysis of scatter plot for first and second features indicate that objects are less separable as value of K increases. Even when k is 2 points of first and second features are not clearly separable which is also indicated by low average Silhouette score.

In [28]:
# Below code to plot Silhouette chart was adapted from scikit learn website.
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
    
for n_clusters in range(2, 15):
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(Census_X_Scaled) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(Census_X_Scaled)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(Census_X_Scaled, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(Census_X_Scaled, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(Census_X_Scaled[:, 0], Census_X_Scaled[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()
For n_clusters = 2 The average silhouette_score is : 0.13633160289830093
For n_clusters = 3 The average silhouette_score is : 0.15156180028205937
For n_clusters = 4 The average silhouette_score is : 0.1475978899089203
For n_clusters = 5 The average silhouette_score is : 0.15453496758453636
For n_clusters = 6 The average silhouette_score is : 0.15317333316738516
For n_clusters = 7 The average silhouette_score is : 0.170460789995345
For n_clusters = 8 The average silhouette_score is : 0.16116897428698124
For n_clusters = 9 The average silhouette_score is : 0.15954892500788456
For n_clusters = 10 The average silhouette_score is : 0.14003819644769538
For n_clusters = 11 The average silhouette_score is : 0.1285409757186093
For n_clusters = 12 The average silhouette_score is : 0.13732632520020158
For n_clusters = 13 The average silhouette_score is : 0.1373796845117717
For n_clusters = 14 The average silhouette_score is : 0.1420346101608728

Observations: Census dataset is a binary dataset, I expected 2 clusters to give the highest Silhouette score. It was observed that highest Silhouette score of 0.584 was achieved with two clusters and average score dropped as number of clusters increased. Scatter plot shows first and second features which are both categorical. Negative silhouette scores are observed for both datasets at high value of K which indicates overlapping clusters. Similarly, some of clusters at higher value of k has thin which indicate some feature have very less impact. It was also observed that Silhouette analysis is computationally expensive as compared to other validation methods.

5.2 Expectation Maximization

K-means is hard clustering method in which each data point is assigned to one and only one cluster. On the other hand Expectation Maximization is a soft clustering method which estimates the parameters by maximizing the log-likelihood of the observed data point. This allows EM to assign data points to more than one cluster based on the posterior probabilities of the data points. Like K-means, this algorithm may converge to local optima but if there is a single global optimum this algorithm is guaranteed to global optima. Expectation Maximization was implemented using GaussianMixture model function in sklearn.

5.2.1 Wine Quality Prediction

In [29]:
model_stats = []
kclusters = range(2, 15)

for k in kclusters:
    model = GaussianMixture(n_components=k)
    t_start = time.time()*1000.0
    model.fit(Wine_X_Scaled)
    predcls = model.predict(Wine_X_Scaled)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "gmm"
    centroids = model.means_
    loglike = model.score(Wine_X_Scaled)
    silscores = silhouette_samples(Wine_X_Scaled, predcls)
    adjmutinfo = adjusted_mutual_info_score(Wine_y,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y,predcls)
    aic = model.aic(Wine_X_Scaled)
    bic = model.bic(Wine_X_Scaled)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

dfstatsEM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'loglike', 'silscore', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'aic', 'bic', 'exectime'])

5.2.2 Income Prediction

In [30]:
model_stats = []
kclusters = range(2, 15)

for k in kclusters:
    model = GaussianMixture(n_components=k, covariance_type='diag')
    t_start = time.time()*1000.0
    model.fit(Census_X_Scaled)
    predcls = model.predict(Census_X_Scaled)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "gmm"
    centroids = model.means_
    loglike = model.score(Census_X_Scaled)
    silscores = silhouette_samples(Census_X_Scaled, predcls)
    adjmutinfo = adjusted_mutual_info_score(Census_y,predcls)
    adjrandinfo = adjusted_rand_score(Census_y,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y,predcls)
    aic = model.aic(Census_X_Scaled)
    bic = model.bic(Census_X_Scaled)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

dfstatsCensusEM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'loglike', 'silscore', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'aic', 'bic', 'exectime'])

5.2.3 BIC Analysis

BIC (Bayesian Information Criterion) is used to carry our analysis of Expectation Maximization algorithms which yield optimal value of K. As compared to AIC, BIC tends to have heavier penalty for additional parameters. BIC was chosen to evaluate EM models because we want to be more stringent in picking feature and avoid overfitting. AIC might be a better choice for dataset exploration.

In [31]:
#Code for below plot was adaoted from sklearn website
#https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html

lowest_bic = np.infty
bic = []
n_components_range = range(2, 15)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(Wine_X_Scaled)
        bic.append(gmm.bic(Wine_X_Scaled))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('Wine - BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
Out[31]:
<matplotlib.legend.Legend at 0x1f684d4aa08>

Observation: BIC score was calculated for 4 different covariance types. It was observed that BIC score decreases with the increase in number of components until component count reached 12.Lowest BIC score was observed for 'full' covariance type and it starts increasing after 14 component.

In [32]:
#Code for below plot was adaoted from sklearn website
#https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html

lowest_bic = np.infty
bic = []
n_components_range = range(2, 15)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type, reg_covar=0.001)
        gmm.fit(Census_X_Scaled)
        bic.append(gmm.bic(Census_X_Scaled))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('Census - BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
Out[32]:
<matplotlib.legend.Legend at 0x1f684e039c8>

Observation: For census dataset also 4 different covariance types were used to compare BIC score. It was observed that BIC score decreases with the increase in number of components. Lowest BIC score was observed for 'diag' covariance type with component count as 12.

5.3 External Validation & Model Comparison

Since both of the previous steps did not match the expected optimal cluster count, further analysis was carried out by leveraging the available labels for each datasets. Following evaluation metrics are used to conduct external measurement of clustering results -

  • Adjusted Rand Index: measures the similarity of the two assignments ignoring permutations. ARI ranges from -1 to 1, value of 0 indicates random assignment. Ideal value should be 1.0 whereas negative values indicates poor clustering.
  • Adjusted Mutual Index: measures the agreement of the two assignments ignoring permutations. AMI ranges from -1 to 1, value of 0 indicates random assignment. Ideal value should be 1.0 whereas negative values indicates poor clustering.
  • Homogeneity, completeness and V-measure: Homogeneity indicates whether each cluster contains only members of a single class and completeness indicates whether all members of a given class are assigned to the same cluster. V-measure is the harmonic mean of homogeneity and completeness.
    Execution time for both Kmeans and GMM algorithm was also compared to evaluate computational efficiency.
In [33]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,10));

axs[0][0].plot(kclusters, dfstatsKM["adjmutinfo"], color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
axs[0][0].plot(kclusters, dfstatsEM["adjmutinfo"], color='red', linestyle='-',  marker='o', label = "EM Adj Mutual Info")
axs[0][0].set_title('Wine - Adj Mutual Info')

axs[0][1].plot(kclusters, dfstatsKM["adjrandinfo"], color='green', linestyle='-', marker='o', label = "KM Adj Random Info")
axs[0][1].plot(kclusters, dfstatsEM["adjrandinfo"], color='black', linestyle='-',  marker='o', label = "EM Adj Random Info")
axs[0][1].set_title('Wine - Adj Random Info')

axs[1][0].plot(kclusters, dfstatsKM["vmeasure"], color='black', linestyle='-', marker='o', label = "KM V Measure")
axs[1][0].plot(kclusters, dfstatsEM["vmeasure"], color='red', linestyle='-',  marker='o', label = "EM V Measure")
axs[1][0].set_title('Wine - V Measure')

axs[1][1].plot(kclusters, dfstatsKM["exectime"], color='green', linestyle='-', marker='o', label = "KM Execution Time")
axs[1][1].plot(kclusters, dfstatsEM["exectime"], color='blue', linestyle='-',  marker='o', label = "EM Execution Time")
axs[1][1].set_title('Wine - Execution Time')

for ax in axs.flat:
    ax.set(xlabel='# of Clusters', ylabel='Score')
    ax.legend(loc='best')
    ax.minorticks_on()
    ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
    ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()

Observation: For wine dataset, GMM clearly performs better as compared to K-means. On all the collected metrics, score was highest for EM when cluster count is 4. K-means assumes the clusters as spherical whereas GMM does not assume clusters to be of any geometric shape and hence works well with non-linear geometric distributions. Hence it outperformed K-means on the wine dataset. It is also observed that computational efficiency of K-means and EM is very close which can possibly attributed to small size of the dataset.

In [34]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,10));

axs[0][0].plot(kclusters, dfstatsCensusKM["adjmutinfo"], color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
axs[0][0].plot(kclusters, dfstatsCensusEM["adjmutinfo"], color='red', linestyle='-',  marker='o', label = "EM Adj Mutual Info")
axs[0][0].set_title('Census - Adj Mutual Info')

axs[0][1].plot(kclusters, dfstatsCensusKM["adjrandinfo"], color='green', linestyle='-', marker='o', label = "KM Adj Random Info")
axs[0][1].plot(kclusters, dfstatsCensusEM["adjrandinfo"], color='black', linestyle='-',  marker='o', label = "EM Adj Random Info")
axs[0][1].set_title('Census - Adj Random Info')


axs[1][0].plot(kclusters, dfstatsCensusKM["vmeasure"], color='black', linestyle='-', marker='o', label = "KM V Measure")
axs[1][0].plot(kclusters, dfstatsCensusEM["vmeasure"], color='red', linestyle='-',  marker='o', label = "EM V Measure")
axs[1][0].set_title('Census - V Measure')

axs[1][1].plot(kclusters, dfstatsCensusKM["exectime"], color='green', linestyle='-', marker='o', label = "KM Execution Time")
axs[1][1].plot(kclusters, dfstatsCensusEM["exectime"], color='blue', linestyle='-',  marker='o', label = "EM Execution Time")
axs[1][1].set_title('Census - Execution Time')

for ax in axs.flat:
    ax.set(xlabel='# of Clusters', ylabel='Score')
    ax.legend(loc='best')
    ax.minorticks_on()
    ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
    ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()

Observation: For census dataset also GMM clearly performs better as compared to K-means. On all the collected metrics, score was highest for GMM when cluster count is 3. It can be inferred that model has some hidden parameters which as not easily observable and hence K-means performed poorly as compared to GMM.

6. Dimensionality Reduction

Dimension reduction is the process of reducing the number of random variables by identifying a set of principal variables. Four dimensionality reduction techniques are implemented to analyze their performance on the wine and census datasets. Objective is to identify optimal dimensions such that minimal information is lost at the same time efficiency of the algorithm is improved.

6.1 Principal component analysis (PCA)

PCA simplifies the complexity in high-dimensional data by identifying the correlation between variable. PCA attempt to reduce the dimensionality if a strong correlation exists between variables. PCA method available in sklearn.decomposition module is implemented to analyze the principle components. As a first step the data was standardized. Then Eigenvectors and Eigenvalues were obtained from the covariance matrix. Set of eigenvectors that correspond to the largest eigenvalues were selected by sorting eigenvectors. After that the projection matrix was constructed from the selected eigenvectors. This project matrix was used to reduce original dataset.

6.1.1 Wine Quality Prediction

In [35]:
pca = PCA()
pca.fit_transform(Wine_X_Scaled)
expvarratio = pca.explained_variance_ratio_
ncomponent = expvarratio.size
cumvariance = np.cumsum(expvarratio)
eigenvalues = pca.singular_values_
Wine_X_proj = pca.inverse_transform(Wine_X_Scaled)
loss = ((Wine_X_Scaled - Wine_X_proj) ** 2).mean()
In [36]:
fig = plt.figure(figsize=(10, 6))
ax1 = fig.add_subplot(111)
ax1.plot(list(range(cumvariance.size)), cumvariance, color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
ax1.set_ylabel('Cumulative Explained Variance Ratio')
plt.xlabel('Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)

ax2 = ax1.twinx()
ax2.plot(list(range(eigenvalues.size)), eigenvalues, color='blue', linestyle='-', label = "Eigenvalues")
ax2.set_ylabel('Eigenvalues', color='r')
for tl in ax2.get_yticklabels():
    tl.set_color('r')

6.1.2 Income Prediction

In [37]:
pca = PCA()
pca.fit_transform(Census_X_Scaled)
expvarratio = pca.explained_variance_ratio_
ncomponent = expvarratio.size
cumvariance = np.cumsum(expvarratio)
eigenvalues = pca.singular_values_
Census_X_proj = pca.inverse_transform(Census_X_Scaled)
loss = ((Census_X_Scaled - Census_X_proj) ** 2).mean()
In [38]:
fig = plt.figure(figsize=(10, 6))
ax1 = fig.add_subplot(111)
ax1.plot(list(range(cumvariance.size)), cumvariance, color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
ax1.set_ylabel('Cumulative Explained Variance Ratio')
plt.xlabel('Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)

ax2 = ax1.twinx()
ax2.plot(list(range(eigenvalues.size)), eigenvalues, color='blue', linestyle='-', label = "Eigenvalues")
ax2.set_ylabel('Eigenvalues', color='r')
for tl in ax2.get_yticklabels():
    tl.set_color('r')

Observation: A scree plot is used to check whether PCA works well on the wine dataset or not. As principal components capture the amount of variation, we can select principal components which capture most of the information and ignore the rest without losing any important information. Scree plot for wine dataset is not ideal (elbow is not prominent) and hence principal components from variance plot are selected which describes at least 80% of the variance. For wine dataset 4 components describe at least 80% of the variance. It can also be observed from scree plot that eigenvalues decrease sharply for first four components and flattens out till 9 components. For census dataset, first 9 components describe at least 80% of the variance.

In [39]:
pcaW = PCA(n_components = 4).fit(Wine_X_train)
Wine_X_train_PCA = pcaW.transform(Wine_X_train)
Wine_X_test_PCA = pcaW.transform(Wine_X_test)

pcaC = PCA(n_components = 9).fit(Census_X_train)
Census_X_train_PCA = pcaC.transform(Census_X_train)
Census_X_test_PCA = pcaC.transform(Census_X_test)

6.2 Independent component analysis (ICA)

ICA algorithm reveals underlying hidden factors and decompose linear mixtures of signals into their independent components. ICA algorithm reveals underlying hidden factors and decompose linear mixtures of signals into their independent components. ICA algorithm carries out the whitening of the data by removing correlations in the data. ICA creates independent components by decomposing multivariate signal. It achieves this by orthogonal rotation and maximize non-gaussian (kurtosis) between components. FastICA method available in sklearn.decomposition module is implemented to analyze the kurtosis.

ICA algorithm can be called as extension of PCA, but it performs better than PCA because it looks at each feature independently. Performance of ICA is impacted if all features contribute in building model.

6.2.1 Wine Quality Prediction

In [40]:
compnts = list(np.arange(1,(Wine_X_Scaled.shape[1]-1),1))

kurtosis = []
for c in compnts:
    ica = FastICA(n_components=c)
    Wineica = ica.fit_transform(Wine_X_Scaled)
    tmpdf = pd.DataFrame(Wineica)
    tmpdf = tmpdf.kurt(axis=0)
    kurtosis.append(tmpdf.abs().mean())
In [41]:
plt.figure(figsize=(6, 6))
plt.plot(compnts, kurtosis, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Kurtosis')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Wine - Kurtosis vs Components")
plt.show()

6.2.2 Income Prediction

In [42]:
compnts = list(np.arange(1,(Census_X_Scaled.shape[1]),1))
kurtosis = []
for c in compnts:
    ica = FastICA(n_components=c)
    Censusica = ica.fit_transform(Census_X_Scaled)
    tmpdf = pd.DataFrame(Censusica)
    tmpdf = tmpdf.kurt(axis=0)
    kurtosis.append(tmpdf.abs().mean())
In [43]:
plt.figure(figsize=(6, 6))
plt.plot(compnts, kurtosis, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Kurtosis')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Census - Kurtosis vs Components")
plt.show()

Observation: For wine dataset the number of components with highest kurtosis is 8 whereas for census dataset the the number of components with highest kurtosis is 7.

In [44]:
icaW = FastICA(n_components = 8).fit(Wine_X_train)
Wine_X_train_ICA = icaW.transform(Wine_X_train)
Wine_X_test_ICA = icaW.transform(Wine_X_test)

icaC = FastICA(n_components = 7).fit(Census_X_train)
Census_X_train_ICA = icaC.transform(Census_X_train)
Census_X_test_ICA = icaC.transform(Census_X_test)

6.3 Randomize Projections

Random Projection is another dimensionality reduction algorithm which projects the input space randomly to lower dimensional subspace. This algorithm is powerful, simple and it has low error rate as compared to other methods. GaussianRandomProjection method available in sklearn's random_projection module is used to analyze reconstruction error for both datasets. This algorithm reduce dimensionality through Gaussian random projection.

6.3.1 Wine Quality Prediction

In [45]:
compnts = list(np.arange(1,(Wine_X_Scaled.shape[1]),1))
err = []
for c in compnts:
    rp = GaussianRandomProjection(n_components=c)
    Winerp = rp.fit_transform(Wine_X_Scaled)
    recon = np.dot(Winerp, rp.components_)  
    reconerr = np.mean((Wine_X_Scaled - recon)**2)  
    err.append(reconerr)    
        
In [46]:
plt.figure(figsize=(6, 6))
plt.plot(compnts, err, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Reconstruction Error')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Wine - Reconstruction Error vs Components")
plt.show()

6.3.2 Income Prediction

In [47]:
compnts = list(np.arange(1,(Census_X_Scaled.shape[1]),1))
err = []
for c in compnts:
    rp = GaussianRandomProjection(n_components=c)
    Censusrp = rp.fit_transform(Census_X_Scaled)
    recon = np.dot(Censusrp, rp.components_)  
    reconerr = np.mean((Census_X_Scaled - recon)**2) 
    err.append(reconerr) 
In [48]:
plt.figure(figsize=(6, 6))
plt.plot(compnts, err, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Reconstruction Error')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Census - Reconstruction Error vs Components")
plt.show()

Observation: Analysis of Reconstruction Error for wine and census dataset reveals that the optimal number of components for wine dataset is 4 whereas optimal component for census dataset is 5.

In [49]:
rpW = GaussianRandomProjection(n_components = 2).fit(Wine_X_train)
Wine_X_train_RP = rpW.transform(Wine_X_train)
Wine_X_test_RP = rpW.transform(Wine_X_test)

rpC = GaussianRandomProjection(n_components = 7).fit(Census_X_train)
Census_X_train_RP = rpC.transform(Census_X_train)
Census_X_test_RP = rpC.transform(Census_X_test)

6.4 Random Forest Classifier

Besides classification, Random Forest Classifier can also we used to reduce dimensionality. Random Forest Classifier can analyze each feature and assess its importance by finding out how much each feature contribute to the model’s information. This algorithm does not assume that data is linearly separable. Random Forest count how many times each feature has been selected for a split and generates a score. This score helps in identifying most predictive features.
Sklearn's RandomForestClassifier is used to assess the importance of features for both datasets.

6.4.1 Wine Quality Prediction

In [50]:
#Below code is adapted from - 
#https://github.com/Einsteinish/bogotobogo-Machine-Learning/blob/master/scikit_machine_learning_Data_Processing-III-Sequential-Feature-Selection.ipynb

feat_labels = df_combined.columns[1:]
forest = RandomForestClassifier(max_depth=10, ccp_alpha=0.007, criterion='entropy')
forest.fit(Wine_X_Scaled, Wine_y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(Wine_X_Scaled.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f],importances[indices[f]]))
 1) volatile acidity               0.443458
 2) citric acid                    0.194500
 3) residual sugar                 0.147542
 4) chlorides                      0.073801
 5) free sulfur dioxide            0.044492
 6) total sulfur dioxide           0.034633
 7) density                        0.023535
 8) pH                             0.018180
 9) sulphates                      0.013862
10) alcohol                        0.003387
11) quality                        0.002611
In [51]:
plt.title('Wine - Feature Importances')
plt.bar(range(Wine_X_Scaled.shape[1]), importances[indices], 
                     color='green', align='center')
plt.xticks(range(Wine_X_Scaled.shape[1]), feat_labels, rotation=90)
plt.xlim([-1, Wine_X_Scaled.shape[1]])
plt.tight_layout()
plt.show()
In [52]:
sfm = SelectFromModel(forest, threshold=0.05)
sfm.fit(Wine_X_train, Wine_y_train)
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])
Wine_X_train_RF = sfm.transform(Wine_X_train)
Wine_X_test_RF = sfm.transform(Wine_X_test)
citric acid
free sulfur dioxide
pH
quality

6.4.2 Income Prediction

In [53]:
#Below code is adapted from - 
#https://github.com/Einsteinish/bogotobogo-Machine-Learning/blob/master/scikit_machine_learning_Data_Processing-III-Sequential-Feature-Selection.ipynb

feat_labels = df_census.columns[1:]
forest = RandomForestClassifier(max_depth=10, ccp_alpha=0.002, criterion='entropy')
forest.fit(Census_X_Scaled, Census_y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(Census_X_Scaled.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f],importances[indices[f]]))
 1) workclass                      0.248275
 2) fnlwgt                         0.210425
 3) education                      0.171631
 4) education-num                  0.129619
 5) marital-status                 0.087531
 6) occupation                     0.036827
 7) relationship                   0.034503
 8) race                           0.031892
 9) sex                            0.023541
10) capital-gain                   0.022102
11) capital-loss                   0.002022
12) hours-per-week                 0.000758
13) native-country                 0.000444
14) income                         0.000430
In [54]:
plt.title('Census - Feature Importances')
plt.bar(range(Census_X_Scaled.shape[1]), importances[indices], 
                     color='green', align='center')
plt.xticks(range(Census_X_Scaled.shape[1]), feat_labels, rotation=90)
plt.xlim([-1, Census_X_Scaled.shape[1]])
plt.tight_layout()
plt.show()
In [55]:
sfm = SelectFromModel(forest, threshold=0.05)
sfm.fit(Census_X_train, Census_y_train)
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])
Census_X_train_RF = sfm.transform(Census_X_train)
Census_X_test_RF = sfm.transform(Census_X_test)
workclass
marital-status
occupation
race
capital-loss

Observation: For wine dataset three features with highest importance are citric acid. free sulfur dioxide, pH and quality. For census dataset, following 8 features with highest importance will be used - workclass, marital-status, occupation, race and capital-loss.

7. Clustering on the Dimensionally Reduced Datasets

In this step two clustering algorithms implemented in section 5, will be reproduced your clustering experiments, but on the datasets on which dimensionality reduction algorithms were implemented in section 6.

7.1 Kmeans - PCA

7.1.1 Wine Quality Prediction

In [56]:
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Wine_X_train_PCA)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_PCA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_PCA_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.1.2 Income Prediction

In [57]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Census_X_train_PCA)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_PCA)
    predcls = model.predict(Census_X_train_PCA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_PCA_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.2 Kmeans - ICA

7.2.1 Wine Quality Prediction

In [58]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Wine_X_train_ICA)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_ICA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_ICA_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.2.2 Income Prediction

In [59]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Census_X_train_ICA)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_ICA)
    predcls = model.predict(Census_X_train_ICA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_ICA_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.3 Kmeans - RP

7.3.1 Wine Quality Prediction

In [60]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Wine_X_train_RP)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_RP)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_RP_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.3.2 Income Prediction

In [61]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Census_X_train_RP)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_RP)
    predcls = model.predict(Census_X_train_RP)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_RP_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.4 Kmeans - Random Forest

7.4.1 Wine Quality Prediction

In [62]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Wine_X_train_RF)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_RF)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_RF_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])

7.4.2 Income Prediction

In [63]:
kclusters = range(2, 15)
for k in kclusters:
    model = KMeans(n_clusters=k).fit(Census_X_train_RF)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_RF)
    predcls = model.predict(Census_X_train_RF)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_RF_Kmeans"
    centroids = model.cluster_centers_
    sse = model.inertia_
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)

    model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
In [64]:
dfstatsKMDimRed = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'sse', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'exectime'])
dfstatsKMDimRed
Out[64]:
model_id k predcls sse centroids adjmutinfo adjrandinfo homogene complete vmeasure exectime
0 Wine_PCA_Kmeans 2 [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ... 25851.766200 [[0.8602931451052558, -0.27547710705429407, 0.... 0.015263 0.044756 0.014424 0.016610 0.015440 0.997314
1 Wine_PCA_Kmeans 3 [2, 2, 0, 1, 2, 1, 0, 1, 0, 0, 2, 1, 1, 2, 1, ... 18546.156424 [[1.6742730275464648, 1.2718018223142724, -0.3... 0.034378 0.061124 0.045373 0.027994 0.034625 0.997314
2 Wine_PCA_Kmeans 4 [2, 2, 1, 0, 3, 0, 1, 0, 1, 1, 2, 0, 0, 3, 0, ... 15230.764838 [[0.3014907703588699, -1.2985093359258801, 0.3... 0.039429 0.070876 0.057586 0.030364 0.039762 0.000000
3 Wine_PCA_Kmeans 5 [0, 0, 1, 2, 3, 2, 1, 4, 1, 4, 0, 4, 4, 3, 2, ... 13708.125829 [[-2.4584123682476418, 0.19471332868583138, -1... 0.039481 0.034913 0.067128 0.028349 0.039863 1.995117
4 Wine_PCA_Kmeans 6 [3, 3, 4, 1, 2, 1, 4, 5, 4, 0, 3, 5, 5, 2, 0, ... 12582.595793 [[0.9355260670483599, 0.1852837138187771, -0.4... 0.036536 0.028152 0.068250 0.025354 0.036973 0.997803
... ... ... ... ... ... ... ... ... ... ... ...
99 Census_RF_Kmeans 10 [9, 6, 2, 5, 1, 7, 2, 1, 6, 6, 7, 6, 1, 4, 9, ... 25530.396844 [[-1.0733593126999927, -1.6759077869237005, 0.... 0.097087 0.031470 0.232003 0.061503 0.097231 958.498535
100 Census_RF_Kmeans 11 [10, 5, 0, 1, 4, 8, 0, 4, 5, 5, 3, 5, 4, 7, 10... 23314.486233 [[0.15877777612656063, 0.12331508415229486, -1... 0.103717 0.033939 0.260297 0.064879 0.103869 1078.130615
101 Census_RF_Kmeans 12 [3, 1, 0, 7, 2, 8, 6, 2, 1, 1, 11, 1, 2, 10, 3... 22415.631298 [[0.1683378245450711, 0.2746523214394192, -0.2... 0.108501 0.037232 0.277260 0.067573 0.108664 1024.260986
102 Census_RF_Kmeans 13 [0, 7, 10, 9, 3, 2, 10, 3, 2, 7, 1, 7, 3, 11, ... 21032.331032 [[-1.102992590910088, -0.49773430483896297, 0.... 0.095589 0.034266 0.254651 0.058969 0.095762 1197.739258
103 Census_RF_Kmeans 14 [2, 3, 5, 13, 6, 8, 12, 6, 8, 3, 9, 3, 6, 11, ... 19573.465280 [[-0.8176433312298983, -2.0271847154334157, 0.... 0.098373 0.031429 0.267150 0.060424 0.098556 1475.869141

104 rows × 11 columns

7.5 EM - PCA

7.5.1 Wine Quality Prediction

In [65]:
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k).fit(Wine_X_train_PCA)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_PCA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_PCA_EM"
    centroids = model.means_
    loglike = model.score(Wine_X_train_PCA)
    silscores = silhouette_samples(Wine_X_train_PCA, predcls)
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
    aic = model.aic(Wine_X_train_PCA)
    bic = model.bic(Wine_X_train_PCA)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.5.2 Income Prediction

In [66]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_PCA)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_PCA)
    predcls = model.predict(Census_X_train_PCA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_PCA_EM"
    centroids = model.means_
    loglike = model.score(Census_X_train_PCA)
    silscores = silhouette_samples(Census_X_train_PCA, predcls)
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
    aic = model.aic(Census_X_train_PCA)
    bic = model.bic(Census_X_train_PCA)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.6 EM - ICA

7.6.1 Wine Quality Prediction

In [67]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k).fit(Wine_X_train_ICA)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_ICA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_ICA_EM"
    centroids = model.means_
    loglike = model.score(Wine_X_train_ICA)
    silscores = silhouette_samples(Wine_X_train_ICA, predcls)
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
    aic = model.aic(Wine_X_train_ICA)
    bic = model.bic(Wine_X_train_ICA)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.6.2 Income Prediction

In [68]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_ICA)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_ICA)
    predcls = model.predict(Census_X_train_ICA)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_ICA_EM"
    centroids = model.means_
    loglike = model.score(Census_X_train_ICA)
    silscores = silhouette_samples(Census_X_train_ICA, predcls)
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
    aic = model.aic(Census_X_train_ICA)
    bic = model.bic(Census_X_train_ICA)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.7 EM - RP

7.7.1 Wine Quality Prediction

In [69]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k).fit(Wine_X_train_RP)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_RP)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_RP_EM"
    centroids = model.means_
    loglike = model.score(Wine_X_train_RP)
    silscores = silhouette_samples(Wine_X_train_RP, predcls)
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
    aic = model.aic(Wine_X_train_RP)
    bic = model.bic(Wine_X_train_RP)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.7.2 Income Prediction

In [70]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_RP)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_RP)
    predcls = model.predict(Census_X_train_RP)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_RP_EM"
    centroids = model.means_
    loglike = model.score(Census_X_train_RP)
    silscores = silhouette_samples(Census_X_train_RP, predcls)
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
    aic = model.aic(Census_X_train_RP)
    bic = model.bic(Census_X_train_RP)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.8 EM - Random Forest

7.8.1 Wine Quality Prediction

In [71]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k).fit(Wine_X_train_RF)
    t_start = time.time()*1000.0
    predcls = model.predict(Wine_X_train_RF)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Wine_RF_EM"
    centroids = model.means_
    loglike = model.score(Wine_X_train_RF)
    silscores = silhouette_samples(Wine_X_train_RF, predcls)
    adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
    aic = model.aic(Wine_X_train_RF)
    bic = model.bic(Wine_X_train_RF)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])

7.8.2 Income Prediction

In [72]:
kclusters = range(2, 15)
for k in kclusters:
    model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_RF)
    t_start = time.time()*1000.0
    model.fit(Census_X_train_RF)
    predcls = model.predict(Census_X_train_RF)
    t_stop = time.time()*1000.0
    exectime = t_stop - t_start
    modelid = "Census_RF_EM"
    centroids = model.means_
    loglike = model.score(Census_X_train_RF)
    silscores = silhouette_samples(Census_X_train_RF, predcls)
    adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
    adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
    homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
    aic = model.aic(Census_X_train_RF)
    bic = model.bic(Census_X_train_RF)
    model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
In [73]:
dfstatsEMDimRed = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'loglike', 'silscore', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'aic', 'bic', 'exectime'])
dfstatsEMDimRed
Out[73]:
model_id k predcls loglike silscore centroids adjmutinfo adjrandinfo homogene complete vmeasure aic bic exectime
0 Wine_PCA_EM 2 [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ... -6.382168 [0.2994411177811154, 0.28201345232883346, 0.43... [[0.9526262166988266, -0.23717137331231622, 0.... 0.016664 0.044907 0.016278 0.017431 0.016835 58097.435550 58283.680016 0.997803
1 Wine_PCA_EM 3 [1, 1, 2, 0, 1, 0, 2, 0, 2, 2, 1, 0, 0, 1, 0, ... -6.319197 [0.30587437096730385, 0.25387928993969294, 0.5... [[0.3435557030274989, -1.249321885272519, 0.42... 0.037691 0.060146 0.050109 0.030521 0.037935 57554.777237 57837.355047 0.998047
2 Wine_PCA_EM 4 [3, 3, 0, 2, 1, 2, 0, 2, 0, 0, 3, 2, 2, 1, 2, ... -6.228540 [0.5641319179569836, 0.4817227312026789, 0.499... [[1.6158967306773642, 1.0947489641948014, -0.4... 0.042597 0.070215 0.062438 0.032708 0.042928 56760.340774 57139.251928 1.029785
3 Wine_PCA_EM 5 [2, 2, 3, 1, 0, 1, 3, 1, 3, 3, 0, 1, 1, 0, 1, ... -6.156342 [0.5322900776599415, 0.3505793062468339, 0.501... [[-2.1140091964464394, 1.0159624546152024, 0.9... 0.038601 0.070779 0.059317 0.029077 0.039025 56133.770133 56609.014632 1.995117
4 Wine_PCA_EM 6 [5, 5, 2, 1, 3, 1, 2, 4, 2, 4, 5, 4, 4, 3, 1, ... -6.135000 [0.516202514581449, 0.40903979391002204, 0.489... [[-0.50841539661482, 2.016854824475285, 0.5633... 0.052313 0.056398 0.091893 0.037011 0.052769 55969.691816 56541.269659 0.997070
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99 Census_RF_EM 10 [9, 2, 7, 8, 2, 6, 7, 2, 2, 2, 5, 2, 2, 2, 9, ... 8.762507 [0.5042593399988161, 0.3139854099144985, 0.154... [[0.5348577521424768, 0.31153617659115096, -0.... 0.104344 0.010593 0.222236 0.068314 0.104504 -369787.603559 -368920.220336 1072.151611
100 Census_RF_EM 11 [9, 2, 3, 1, 2, 0, 3, 2, 2, 2, 8, 2, 2, 2, 9, ... 9.066826 [0.5026306369278313, 0.18925568623879807, 0.15... [[0.46667489208941354, 0.14062435719881491, -0... 0.105549 0.008401 0.230694 0.068575 0.105723 -382615.786584 -381660.869275 1076.219238
101 Census_RF_EM 12 [1, 9, 2, 5, 9, 3, 8, 9, 9, 9, 6, 9, 9, 9, 1, ... 9.872868 [0.5070920323555642, 0.2930065778747835, 0.294... [[0.4799012470084904, 1.128918381851001, -0.38... 0.109519 0.008255 0.248286 0.070405 0.109703 -416629.740267 -415587.288871 889.070312
102 Census_RF_EM 13 [10, 5, 11, 4, 5, 12, 7, 5, 5, 5, 0, 5, 5, 5, ... 9.276156 [0.5207379418342953, 0.17435938709011584, 0.29... [[0.42529179332506023, -0.02991864105873517, -... 0.111762 0.011391 0.253048 0.071883 0.111962 -391410.957999 -390280.972516 1368.344727
103 Census_RF_EM 14 [9, 7, 5, 10, 4, 2, 11, 7, 7, 7, 8, 7, 7, 7, 9... 10.071025 [0.5105094884129544, 0.12524757688363344, 0.28... [[0.8663986580886773, -0.2592991499196844, 1.6... 0.107562 0.027863 0.262664 0.067787 0.107763 -424953.090463 -423735.570894 1691.018799

104 rows × 14 columns

In [76]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));

axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Wine PCA EM")
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - Adj Mutual Info')

axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Wine ICA EM")
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - Adj Mutual Info')

axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Wine RP EM")
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - Adj Mutual Info')

axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Wine RF EM")
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["adjmutinfo"], linestyle='-',  marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - Adj Mutual Info')

for ax in axs.flat:
    ax.set(xlabel='# of Clusters', ylabel='Score')
    ax.legend(loc='best')
    ax.minorticks_on()
    ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
    ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
In [77]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));

axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Wine PCA EM")
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - Adj Random Info')

axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Wine ICA EM")
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - Adj Random Info')

axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Wine RP EM")
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - Adj Random Info')

axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Wine RF EM")
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["adjrandinfo"], linestyle='-',  marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - Adj Random Info')


for ax in axs.flat:
    ax.set(xlabel='# of Clusters', ylabel='Score')
    ax.legend(loc='best')
    ax.minorticks_on()
    ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
    ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()

Oberservation: For wine dataset highest ARI score of 0.072 was for PCA at cluster size of 5 whereas ARI score of 0.155 was the highest for Census dataset for 4 FastICA clusters.

In [110]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));

axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["vmeasure"], linestyle='-',  marker='o', label = "Wine PCA EM")
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["vmeasure"], linestyle='-',  marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - V Measure')

axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["vmeasure"], linestyle='-',  marker='o', label = "Wine ICA EM")
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["vmeasure"], linestyle='-',  marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - V Measure')

axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["vmeasure"], linestyle='-',  marker='o', label = "Wine RP EM")
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["vmeasure"], linestyle='-',  marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - V Measure')

axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["vmeasure"], linestyle='-',  marker='o', label = "Wine RF EM")
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["vmeasure"], linestyle='-',  marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - V Measure')

for ax in axs.flat:
    ax.set(xlabel='# of Clusters', ylabel='Score')
    ax.legend(loc='best')
    ax.minorticks_on()
    ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
    ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()

Oberservation: Across all dimensionality reduction techniques, V-Measure score for wine dataset was lower than that for Census dataset. Highest V-Measure score for wine dataset was for FastICA model for 5 clusters. For census dataset, Gaussian mixture model implemented on the Random Forest classifier gave the highest V-Measure score of 0.16 for 5 clusters.

In [79]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));

axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["exectime"], linestyle='-',  marker='o', label = "Wine PCA EM", alpha=0.5)
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["exectime"], linestyle='-',  marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - Execution Time')

axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["exectime"], linestyle='-',  marker='o', label = "Wine ICA EM", alpha=0.5)
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["exectime"], linestyle='-',  marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - Execution Time')

axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["exectime"], linestyle='-',  marker='o', label = "Wine RP EM", alpha=0.5)
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["exectime"], linestyle='-',  marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - Execution Time')

axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["exectime"], linestyle='-',  marker='o', label = "Wine RF EM", alpha=0.5)
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["exectime"], linestyle='-',  marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - Execution Time')

for ax in axs.flat:
    ax.set(xlabel='# of Clusters', ylabel='Time')
    ax.legend(loc='best')
    ax.minorticks_on()
    ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
    ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()

Oberservation: For all four dimensionality reduction techniques, overall execution time for Wine dataset was close to zero where execution time for Census dataset incresed exponentially with the increased in # of clusters. This may be due to size of the Census dataset. For Census dataset EM was computationally efficient as compared to Kmeans algorithm.

8. Neural Network on the Dimensionally Reduced Datasets

In this section dimensionality reduction algorithms are applied to census datasets and supervised neural network classifier wes re-run on the newly projected data.

In [118]:
size = 10
cv = StratifiedKFold(size, shuffle=True, random_state=1)

8.1 Neural Network using Original Census Dataset

In [90]:
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='Original - Census')
 
modelid = "Original - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
Confusion matrix, without normalization
[[5891  878]
 [ 632 1648]]
Precision: 0.65, Recall: 0.72, F1 Score: 0.69, Train Time: 14023.03, Test Time: 28.92
In [91]:
utils.plot_pr_curve(Census_mlpclf, Census_X_test, Census_y_test, "Original - Census - PR Curve", False)
Optimal Threshold: 
0.35662597168261606
Maximized F1 score: 
0.6877628680152212

8.2 Neural Network using Reduced Census Dataset

8.2.1 PCA

In [92]:
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_PCA, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_PCA)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='PCA - Census')
 
modelid = "PCA - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
Confusion matrix, without normalization
[[5788  981]
 [ 578 1702]]
Precision: 0.63, Recall: 0.75, F1 Score: 0.69, Train Time: 9733.83, Test Time: 25.93
In [93]:
utils.plot_pr_curve(Census_mlpclf, Census_X_test_PCA, Census_y_test, "PCA - Census - PR Curve", False)
Optimal Threshold: 
0.4530754337179637
Maximized F1 score: 
0.6885593220338982

8.2.2 ICA

In [94]:
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_ICA, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_ICA)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='ICA - Census')
 
modelid = "ICA - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
Confusion matrix, without normalization
[[5926  843]
 [1031 1249]]
Precision: 0.60, Recall: 0.55, F1 Score: 0.57, Train Time: 1596.24, Test Time: 7.98
In [95]:
utils.plot_pr_curve(Census_mlpclf, Census_X_test_ICA, Census_y_test, "ICA - Census - PR Curve", False)
Optimal Threshold: 
0.2939128383505911
Maximized F1 score: 
0.6183895338124193

8.2.3 RP

In [96]:
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_RP, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_RP)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='RP - Census')
 
modelid = "RP - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
Confusion matrix, without normalization
[[5804  965]
 [ 707 1573]]
Precision: 0.62, Recall: 0.69, F1 Score: 0.65, Train Time: 9643.64, Test Time: 17.95
In [97]:
utils.plot_pr_curve(Census_mlpclf, Census_X_test_RP, Census_y_test, "RP - Census - PR Curve", False)
Optimal Threshold: 
0.34308014417880744
Maximized F1 score: 
0.6555810397553518

8.2.4 RF

In [98]:
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_RF, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_RF)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='RF - Census')
 
modelid = "RF - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
Confusion matrix, without normalization
[[5991  778]
 [ 697 1583]]
Precision: 0.67, Recall: 0.69, F1 Score: 0.68, Train Time: 5856.28, Test Time: 19.95
In [99]:
utils.plot_pr_curve(Census_mlpclf, Census_X_test_RF, Census_y_test, "RF - Census - PR Curve", False)
Optimal Threshold: 
0.34716751079080427
Maximized F1 score: 
0.6886792452830188

Observation: Analysis of four dimensionality reduction algorithms reveals that performance neural network algorithm on the datasets reduced using PCA and Random Forest slightly out performances the neural network on the original dataset. Performance of both dimensionality reduction techniques is very close. Performance on the dataset reduced using ICA was worst and much lower than performance on the original dataset.

9. Neural Network on the Clustered Census Datasets

In this section neural network was implemented on the clustered dataset. Clustered census dataset represents each cluster as a dataset feature which leads to reduced dimensions. Both clustering algorithms will be find implemented to identify optimal value of K. Then Neural network algorithm will be tuned to optimize performance.

9.1. Identify Optimal Value of K

In [119]:
model_stats = []
kclusters = range(2, 15)

for k in kclusters:
    model = KMeans(n_clusters=k)
    model.fit(Census_X_train)
    Census_X_train_kmeans = model.transform(Census_X_train)
    Census_X_test_kmeans = model.transform(Census_X_test)

    Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
    t_start = time.time()*1000.0
    Census_mlpclf = Census_mlpclf.fit(Census_X_train_kmeans, Census_y_train)
    t_stop = time.time()*1000.0
    traintime = t_stop - t_start

    t_start = time.time()*1000.0
    Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_kmeans)[:,1] >= 0.4).astype(bool)
    t_stop = time.time()*1000.0
    testtime = t_stop - t_start

    cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
    np.set_printoptions(precision=2)

    modelid = "Kmeans - Census"
    prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
    rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
    f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

    model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
    print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
K: 2, Precision: 0.55, Recall: 0.61, F1 Score: 0.58, Train Time: 4733.23, Test Time: 7.98
K: 3, Precision: 0.62, Recall: 0.44, F1 Score: 0.51, Train Time: 3932.65, Test Time: 11.97
K: 4, Precision: 0.62, Recall: 0.60, F1 Score: 0.61, Train Time: 2089.07, Test Time: 11.97
K: 5, Precision: 0.57, Recall: 0.71, F1 Score: 0.63, Train Time: 4863.37, Test Time: 13.96
K: 6, Precision: 0.62, Recall: 0.62, F1 Score: 0.62, Train Time: 4884.00, Test Time: 13.96
K: 7, Precision: 0.56, Recall: 0.83, F1 Score: 0.66, Train Time: 5773.33, Test Time: 24.93
K: 8, Precision: 0.69, Recall: 0.57, F1 Score: 0.62, Train Time: 6675.99, Test Time: 14.96
K: 9, Precision: 0.68, Recall: 0.55, F1 Score: 0.61, Train Time: 2396.64, Test Time: 13.96
K: 10, Precision: 0.64, Recall: 0.68, F1 Score: 0.66, Train Time: 6636.84, Test Time: 12.97
K: 11, Precision: 0.77, Recall: 0.39, F1 Score: 0.52, Train Time: 3082.36, Test Time: 11.97
K: 12, Precision: 0.62, Recall: 0.72, F1 Score: 0.66, Train Time: 4115.58, Test Time: 13.96
K: 13, Precision: 0.64, Recall: 0.71, F1 Score: 0.67, Train Time: 8229.54, Test Time: 17.95
K: 14, Precision: 0.65, Recall: 0.66, F1 Score: 0.66, Train Time: 4945.67, Test Time: 16.95
In [113]:
model_stats = []
kclusters = range(2, 15)

for k in kclusters:
    model = GaussianMixture(n_components=k, covariance_type='diag')
    model.fit(Census_X_train)
    Census_X_train_EM = model.predict_proba(Census_X_train)
    Census_X_test_EM = model.predict_proba(Census_X_test)

    Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
    t_start = time.time()*1000.0
    Census_mlpclf = Census_mlpclf.fit(Census_X_train_EM, Census_y_train)
    t_stop = time.time()*1000.0
    traintime = t_stop - t_start

    t_start = time.time()*1000.0
    Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_EM)[:,1] >= 0.4).astype(bool)
    t_stop = time.time()*1000.0
    testtime = t_stop - t_start

    cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
    np.set_printoptions(precision=2)

    # Plot non-normalized confusion matrix
#     plt.figure()
#     utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='Kmeans - Census')

    modelid = "EM - Census"
    prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
    rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
    f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

    model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])

    
    print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
C:\Users\abhijit\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
K: 2, Precision: 0.00, Recall: 0.00, F1 Score: 0.00, Train Time: 1969.73, Test Time: 5.98
K: 3, Precision: 0.48, Recall: 0.67, F1 Score: 0.56, Train Time: 780.91, Test Time: 5.98
K: 4, Precision: 0.56, Recall: 0.18, F1 Score: 0.27, Train Time: 1441.15, Test Time: 5.98
K: 5, Precision: 0.49, Recall: 0.68, F1 Score: 0.57, Train Time: 1070.67, Test Time: 5.98
K: 6, Precision: 0.45, Recall: 0.82, F1 Score: 0.58, Train Time: 1233.84, Test Time: 7.98
K: 7, Precision: 0.49, Recall: 0.74, F1 Score: 0.59, Train Time: 1371.86, Test Time: 5.98
K: 8, Precision: 0.46, Recall: 0.79, F1 Score: 0.58, Train Time: 1806.69, Test Time: 6.98
K: 9, Precision: 0.44, Recall: 0.59, F1 Score: 0.51, Train Time: 3226.08, Test Time: 6.98
K: 10, Precision: 0.48, Recall: 0.72, F1 Score: 0.57, Train Time: 951.45, Test Time: 6.98
K: 11, Precision: 0.48, Recall: 0.72, F1 Score: 0.57, Train Time: 1227.72, Test Time: 6.98
K: 12, Precision: 0.60, Recall: 0.29, F1 Score: 0.39, Train Time: 747.00, Test Time: 5.98
K: 13, Precision: 0.48, Recall: 0.61, F1 Score: 0.54, Train Time: 1372.37, Test Time: 7.98
K: 14, Precision: 0.66, Recall: 0.23, F1 Score: 0.34, Train Time: 1421.20, Test Time: 7.98

Observations: Both K-means and GMM algorithms are executed for multiple times to measure Precision, Recall, F1 Score, Train Time and Test Time. Based on highest F1 score optimal value of clusters was chosen. For K-means algorithm is optimal cluster count was 8. The optimal numbers of components for Gaussian Mixture Model algorithm was 7. It was observed that for K-means model multiple Ks had highest F1 Score, cluster count of 8 was chosen because it has lowest Train and Test Time.

9.2 K-means

9.2.1 Neural Network Tuning

Some of neural network parameters tuned during implementation of supervised learning algorithm were re-tuned during this step. Optimal value of activation was identified as logistic and that for learning rate was 0.003. Optimization was carried out using GridSearchCV method with 10-fold stratified cross validation.

In [127]:
model = KMeans(n_clusters=8)
model.fit(Census_X_train)
Census_X_train_kmeans = model.transform(Census_X_train)
Census_X_test_kmeans = model.transform(Census_X_test)
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
Census_mlpclf = Census_mlpclf.fit(Census_X_train_kmeans, Census_y_train)

# Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, random_state=1)

activation =['identity', 'logistic', 'tanh', 'relu']
learning_rates = np.arange(0.0, 0.005, 0.0005)
param_grid = {'learning_rate_init': learning_rates, 'activation': activation}

Census_mlpgs= GridSearchCV(Census_mlpclf,param_grid, cv=cv, scoring='f1', n_jobs=-1)
Census_mlpgs.fit(Census_X_train_kmeans,Census_y_train)
Out[127]:
GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=1, shuffle=True),
             error_score=nan,
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=57,
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_fun=15000,
                                     max_iter=200, momentum=0.9,
                                     n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=1, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'activation': ['identity', 'logistic', 'tanh', 'relu'],
                         'learning_rate_init': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1', verbose=0)
In [128]:
print("Best Score:" + str(Census_mlpgs.best_score_))
print("Best Parameters: " + str(Census_mlpgs.best_params_))
print("Best Estimator: " + str(Census_mlpgs.best_estimator_))
Best Score:0.5966351610353527
Best Parameters: {'activation': 'logistic', 'learning_rate_init': 0.004}
Best Estimator: MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
              beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=57, learning_rate='constant',
              learning_rate_init=0.004, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
In [129]:
Census_mlpclf = MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
              beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=57, learning_rate='constant',
              learning_rate_init=0.004, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

#plot learning curve
utils.plot_learning_curve(Census_mlpclf, "MLP - Census - Learning Curve", Census_X_train_kmeans, Census_y_train, cv=cv, n_jobs=-1, exp_score=0.7, train_sizes=np.linspace(0.05, 1.0, 10), ylim=[0.5,1])

9.2.2 Neural Network Performance on Test Set

In [130]:
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_kmeans, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_kmeans)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='Kmeans - Census')

modelid = "Kmeans - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))



utils.plot_pr_curve(Census_mlpclf, Census_X_test_kmeans, Census_y_test, "Kmeans - Census - PR Curve", False)
Confusion matrix, without normalization
[[6254  515]
 [1054 1226]]
K: 14, Precision: 0.70, Recall: 0.54, F1 Score: 0.61, Train Time: 7111.29, Test Time: 7.98
Optimal Threshold: 
0.22768185733371651
Maximized F1 score: 
0.6646176098354727
In [ ]:
 

9.3 Expectation Maximization

9.3.1 Neural Network Tuning

Some of neural network parameters tuned during implementation of supervised learning algorithm were re-tuned during this step. Optimal value of activation was identified as logistic and that for learning rate was 0.003. Optimization was carried out using GridSearchCV method with 10-fold stratified cross validation.

In [122]:
model = GaussianMixture(n_components=7, covariance_type='diag')
model.fit(Census_X_train)
Census_X_train_EM = model.predict_proba(Census_X_train)
Census_X_test_EM = model.predict_proba(Census_X_test)

# Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
Census_mlpclf =  MLPClassifier(hidden_layer_sizes=57, random_state=1)

activation =['identity', 'logistic', 'tanh', 'relu']
learning_rates = np.arange(0.0, 0.005, 0.0005)
param_grid = {'learning_rate_init': learning_rates, 'activation': activation}

Census_mlpgs= GridSearchCV(Census_mlpclf,param_grid, cv=cv, scoring='f1', n_jobs=-1)
Census_mlpgs.fit(Census_X_train_EM,Census_y_train)
Out[122]:
GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=1, shuffle=True),
             error_score=nan,
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=57,
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_fun=15000,
                                     max_iter=200, momentum=0.9,
                                     n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=1, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'activation': ['identity', 'logistic', 'tanh', 'relu'],
                         'learning_rate_init': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1', verbose=0)
In [124]:
print("Best Score:" + str(Census_mlpgs.best_score_))
print("Best Parameters: " + str(Census_mlpgs.best_params_))
print("Best Estimator: " + str(Census_mlpgs.best_estimator_))
Best Score:0.42871977717404725
Best Parameters: {'activation': 'logistic', 'learning_rate_init': 0.003}
Best Estimator: MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
              beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=57, learning_rate='constant',
              learning_rate_init=0.003, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
In [132]:
Census_mlpclf = MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
              beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=57, learning_rate='constant',
              learning_rate_init=0.003, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

#plot learning curve
utils.plot_learning_curve(Census_mlpclf, "KNN (Itr 1) - Census - Learning Curve", Census_X_train_EM, Census_y_train, cv=cv, n_jobs=-1, exp_score=0.7, train_sizes=np.linspace(0.05, 1.0, 10), ylim=[0.1,1])

9.3.2 Neural Network Performance on Test Set

In [133]:
Census_mlpclf = Census_mlpclf.fit(Census_X_train_EM, Census_y_train)


t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_EM, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start

t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_EM)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start

cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='EM - Census')

modelid = "EM - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')

model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))

utils.plot_pr_curve(Census_mlpclf, Census_X_test_EM, Census_y_test, "EM - Census - PR Curve", False)
Confusion matrix, without normalization
[[6055  714]
 [1485  795]]
K: 14, Precision: 0.53, Recall: 0.35, F1 Score: 0.42, Train Time: 1683.51, Test Time: 12.97
Optimal Threshold: 
0.366104308445534
Maximized F1 score: 
0.5941365870191493

Observation: Neural Network trained on the dataset clustered using Gaussian Mixture Model performed very poorly as compared performance of the neural network on the original dataset. Test and Train time on this clustered dataset was much lesser than that on the original dataset. This reduction in the execution time is due to reduced number of features.

10. Conclusion

Looking at the performance of the clustering and dimensionality reduction techniques, it can be concluded that for selected datasets dimensionality reduction techniques performed better as compared to clustering methods. But when compared to original dataset, performance on dimensionally reduced dataset was not substantially better. One of the reasons for this could be the structure of the dataset. Both of the chosen datasets had less than 15 attributes. Although performance improvement was not substantial for PCA and Random Forest but there was substantial difference in the training and testing time. Hence dimensionality reduction techniques should not only be looked at for performance improvement but they can help in reducing execution time.